Philosophy

MicrobeR is intended to supplement other packages such as phyloseq and vegan by providing wrapped functions for common analysis. As such, it calls upon these packages frequently behind the scenes. MicrobeR is primarily intended for data visualization and exploration of count-based microbiome data such as 16S rRNA gene sequencing. These functions are intended for data exploration and as such can be wrapped using plotly’s ggplotly to create interactive figures. Additionally, all plotting is carried out by ggplot2 so they can be manipulated directly through the addition of standard ggplot2 functions.

Expected Input Data Format

MicrobeR relies on 4 main types of data:


MicrobeR Functions

Visualization

Microbiome.Barplot.R: Creates a barplot of %-normalized abundances.
Microbiome.Heatmap.R: Creates a heatmap of feature abundances.
PCoA.R: Calculate a distance/dissimilarity metric, carryout PCoA and plot a 2D plot with desired metadata included.
PCoA3D.R: Similar to above but creating a 3D interactive version. This may not work on some windows computers.

Normalization

Make.CLR.R: Carry out a centered-log2-ratio transformation using either a prior or the count zero multiplicative (CZM) method.
Make.Percent.R: Convert table of counts, to a percentage for plotting purposes.
Read.Filter.R: Removes samples that fell below a certain threshold of read depth. This should be used to identify poorly sequenced samples and remove controls from datasets.
Subsample.Table.R: Subsample feature table for metrics such as unweighted UniFrac using a defined randomization seed for reproducibility.
Summarize.Taxa.R: Analogous to QIIME’s summarize_taxa.py. Creates a list of taxonomically summarized versions of the feature table. Useful for plotting and some statistical treatments.
Confidence.Filter.OTUs.R: Removes features which are present in less than X samples with a total of less than Y reads. Useful for removing noisy sparse features from datasets before visualization or statistical analysis.
Filter.OTUs.Fraction.R: Removes features which make up less than X% of dataset for diversity metrics as recommended by Bokulich et al. doi:10.1038/nmeth.2276.

Other

Make.Tree.R: Make a phylogenetic tree for UniFrac metrics. This function relies on an install of QIIME with muscle. See documentation for this function for more information.
Merge.Replicates.R: Merges replicate samples (for example sequencing replicates) by summing reads together.
Nice.Table.R: A wrapper to create an interactive table for data exploration or embedding into markdown document.

Embedded Data

data("MicrobeRTestData"): Creates the following objects based on Bisanz et al. doi: 10.1128/AEM.00780-15. The data was created during a tutorial on microbiome analysis here.
- metadata: Example list of metadata for 20 individuals sampled 2x.
- table: Example SV table of count data.
- taxonomy: Table of taxonomic assignments.
- tree: Phylogenetic tree of SVs.


Example Usage

In these examples, we will be renderering interactive versions of the figures by wrapping all functions in ggplotly(). This is optional but is helpful for data exploration.
We can start by installing the package if you have not already. If installation fails, check which dependency was missing and install manually using install.packages() or bioconductors biocLite().

library(devtools)
install_github("jbisanz/MicrobeR")

Next we can load it and check the version. We will also load plotly which allows for interactive visualizations. For publication figures I would avoid this and save as PDFs.

library(MicrobeR)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
packageVersion("MicrobeR")
## [1] '0.2.0'

We can start by loading the included vaginal microbiome dataset from Bisanz et al. doi: 10.1128/AEM.00780-15.

data("MicrobeRTestData")

Next we can inspect our metadata using the Nice.Table() command. Note that the data can be searched, filtered, sorted, and exported to common file formats.

Nice.Table(metadata)

Lets explore global trends in our data with a PCoA using Bray Curtis. Note that ggplotly creates an interactive version and + ggtitle() has been added to manually change the title. Notice how an ADONIS test is automatically applied. This can be disabled with ADONIS=FALSE.

ggplotly(PCoA(METRIC="braycurtis", OTUTABLE = table, METADATA = metadata, COLOR = "Timepoint") + ggtitle("Exploratory PCoA"))
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## Warning in set.seed(rngseed): '.Random.seed' is not an integer vector but
## of type 'NULL', so ignored
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Bray Curtis..."
## [1] "Calculating ADONIS for Timepoint"
## [1] "Timepoint-> ADONIS P=0.001 R2=0.135"

We can also make a 3D version of this. If this, or the figure above do not show, enable webgl in your browser.

PCoA3D(METRIC="braycurtis", OTUTABLE = table, METADATA = metadata, COLOR="Timepoint")
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Bray Curtis..."

Before going further with visualization. Lets remove some of the noisy features from our dataset. In this case it will need to be in 2 samples with at least 100 reads across all samples.

conf.table<-Confidence.Filter.OTUs(table, 2, 100)
## [1] "Filtering OTUs such that they are present in at least 2  samples with a total of at least  100  reads"
## [1] "...There are 692667 reads and 202  OTUs"
## [1] "...After filtering there are 689287 reads and 117 OTUs"

Now lets create a taxa summarized version of our data for plotting purposes.

summarized.taxa<-Summarize.Taxa(conf.table, taxonomy)
print(paste0("We have ", nrow(summarized.taxa$Genus), " genera in dataset."))
## [1] "We have 47 genera in dataset."

Now that we have summarized taxa, we can plot these in a barplot. For barplots, >10 features generally creates extremely difficult to interpret figures, as such >10 is automatically added to a remainder, but this number can be manually altered. See instructions for more information. In this case I will plot genus level abundances.

ggplotly(Microbiome.Barplot(OTUTABLE = summarized.taxa$Genus, METADATA = metadata, CATEGORY = "Timepoint"))

We can also look at genera for this analysis using a heat map plotting the 50 most abundant genera.

Microbiome.Heatmap(OTUTABLE=summarized.taxa$Genus, METADATA=metadata, ROWCLUSTER = "abundance", CATEGORY="Timepoint")
## [1] "No transform specified, using log10"
## [1] "Rows ranked on average abundance high to low."
## [1] "Minimum abundance: -2"
## [1] "Maximum abundance: 1.99668819247472"

Say we want to do some additional analysis. In this case train a random forest classifier to predict 2nd trimester versus birth. Counts, or even fractional abundances, are often inappropriate due to their compositional nature. In this case we will apply a CLR transformation to help with this. See references in documention for Make.CLR for more information. *Note: tuning of RF parameters would help considerably here as would MANY more samples. This is for example only.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
set.seed(09072017)
clr.table<-Make.CLR(table, CZM = TRUE)
## No. corrected values:  11
trainset<-sample(colnames(clr.table), round(ncol(clr.table)*2/3,0))#create a test and training set using 1/3 and 2/3 respectively.
testset<-colnames(clr.table)[!colnames(clr.table) %in% trainset]


tune<-tuneRF(t(clr.table[,trainset]),factor(metadata[trainset,]$Timepoint), ntreeTry=1000, stepFactor=1.1,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
## mtry = 14  OOB error = 25.93% 
## Searching left ...
## mtry = 13    OOB error = 25.93% 
## 0 0.01 
## Searching right ...
## mtry = 15    OOB error = 29.63% 
## -0.1428571 0.01

trainrf<-randomForest(t(clr.table[,trainset]), factor(metadata[trainset,]$Timepoint), ntree=10000, mtry=13)

We can check our best predictors below.

varImpPlot(trainrf)

Now we can look at the performance.

test.prob = predict(trainrf,type="prob",newdata=t(clr.table[,testset]))[,2]
test.pred = prediction(test.prob, factor(metadata[testset,]$Timepoint))
test.perf = performance(test.pred,"tpr","fpr")
plot(test.perf,main="ROC Curve",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")

AUC=performance(test.pred,"auc") #Calculate the AUC value
print(paste("Area under curve for ROC is", round(AUC@y.values[[1]],3)))
## [1] "Area under curve for ROC is 0.833"